stdin And stdout

Data can be piped into Python using sys.stdin and sys.stdout:


In [1]:
# egrep.py
import sys, re

# sys.argv is the list of command-line arguments
# sys.argv[0] is the name of the program itself
# sys.argv[1] will be the regex specified at the command line
regex = sys.argv[1]

# for every line passed into the script
for line in sys.stdin:
    # if it matches the regex, write it to stdout
    if re.search(regex, line):
        sys.stdout.write(line)

In [9]:
# line_count.py
import sys

count = 0
for line in sys.stdin:
    count += 1

# print goes to sys.stdout
print(count)


0

In [7]:
!type SomeFile.txt | python egrep.py "[0-9]" | python line_count.py


python: can't open file 'egrep.py': [Errno 2] No such file or directory
python: can't open file 'line_count.py': [Errno 2] No such file or directory

Reading Files

The Basics Of Text Files

Use the open function to open a file:


In [10]:
# 'r' means read-only
file_for_reading = open('reading_file.txt', 'r')

# 'w' is write—will destroy the file if it already exists!
file_for_writing = open('writing_file.txt', 'w')

# 'a' is append—for adding to the end of the file
file_for_appending = open('appending_file.txt', 'a')

# don't forget to close your files when you're done
file_for_writing.close()


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-10-de2791de0a52> in <module>()
      1 # 'r' means read-only
----> 2 file_for_reading = open('reading_file.txt', 'r')
      3 
      4 # 'w' is write—will destroy the file if it already exists!
      5 file_for_writing = open('writing_file.txt', 'w')

FileNotFoundError: [Errno 2] No such file or directory: 'reading_file.txt'

Use a with block to ensure that files are closed:


In [15]:
with open('SomeFile.txt', 'r') as f:
    for line in f:
        print(line.strip())

# After with block, file is closed


line 1
line 2
line 3

Delimited Files

It's easy to work with delimited files.


In [64]:
import csv

with open('stocks.csv', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        date = row[0]
        symbol = row[1]
        closing_price = float(row[2])
        print(date, symbol, closing_price)


6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5

In [65]:
with open('stocks-headers.csv', 'r') as f:
    reader = csv.DictReader(f, delimiter=':')
    for row in reader:
        date = row['date']
        symbol = row['symbol']
        closing_price = float(row['closing_price'])
        print(date, symbol, closing_price)


6/20/2014 AAPL 90.91
6/20/2014 MSFT 41.68
6/20/2014 FB 64.5

HTML And The Parsing Thereof


In [45]:
some_html = """
<html>
<head>
<title>A web page</title>
</head>
<body>
<p id="author">Joel Grus</p>
<p id="subject" class="important">Data Science</p>
</body>
</html>
"""

In [46]:
from bs4 import BeautifulSoup
import requests

html = requests.get('http://www.example.com').text
html = some_html
soup = BeautifulSoup(html, 'html5lib')

In [47]:
first_paragraph = soup.find('p')
first_paragraph


Out[47]:
<p id="author">Joel Grus</p>

In [48]:
soup.p.text, soup.p.text.split()


Out[48]:
('Joel Grus', ['Joel', 'Grus'])

In [49]:
soup.p['id']


Out[49]:
'author'

In [50]:
soup.p.get('id')


Out[50]:
'author'

In [51]:
soup.find_all('p')


Out[51]:
[<p id="author">Joel Grus</p>,
 <p class="important" id="subject">Data Science</p>]

In [52]:
[p for p in soup('p') if p.get('id')]


Out[52]:
[<p id="author">Joel Grus</p>,
 <p class="important" id="subject">Data Science</p>]

In [53]:
soup('p', {'class' : 'important'})


Out[53]:
[<p class="important" id="subject">Data Science</p>]

In [54]:
soup('p', 'important')


Out[54]:
[<p class="important" id="subject">Data Science</p>]

In [56]:
[p for p in soup('p') if 'important' in p.get('class', [])]


Out[56]:
[<p class="important" id="subject">Data Science</p>]

Using APIs

JSON (And XML)

APIs transfer data in a certain format--usually JSON sometimes XML.


In [59]:
import json

json_string = """{ "title" : "Data Science Book",
                   "author" : "Joel Grus",
                   "publicationYear" : 2014,
                   "topics" : [ "data", "science", "data science"] }"""

# parse the JSON into a Python Dictionary
dict = json.loads(json_string)
if 'data science' in dict['topics']:
    print(dict)


{'title': 'Data Science Book', 'author': 'Joel Grus', 'publicationYear': 2014, 'topics': ['data', 'science', 'data science']}

Using An Unauthenticated API


In [ ]:
endpoint = 'https://api.github.com/users/joelgrus/repos'
repos = json.loads(requests.get(endpoint).text)

In [ ]: